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DATABASE ANNOTATION AND RETRIEVAL 

The present invention relates to the annotation of data 
files which are to be stored in a database for 

5 facilitating their subsequent retrieval. The present 
invention is also concerned with a system for generating 
the annotation data which is added to the data file and 
to a system for searching the annotation data in the 
database to retrieve a desired data file in response to 

0 a user's input query. 

Databases of information are well known and suffer from 
the problem of how to locate and retrieve the desired 
information from the database quickly and efficiently. 
Existing database search tools allow the user to search 
the database using typed keywords. Whilst this is quick 
and efficient, this type of searching is not suitable for 
various kinds of databases, such as video or audio 
databases . 

According to one aspect, the present invention aims to 
provide a data structure which will allow the annotation 
of data files within a database which will allow a quick 
and efficient search to be carried out in response to a 
user's input query. 

According to one aspect, the present invention provides 
data defining a phoneme and word lattice for use as an 
annotation data for annotating data files to be stored 
within a database. Preferably, the data defines a 
plurality of nodes within the lattice and a plurality of 
links connecting the nodes within the lattice and further 
data associates a plurality of phonemes with a respective 
plurality of links and further data associates at least 
one word with at least one of said links. 



WO 00/54168 



PCT/GBOO/00718 



According to another aspect, the present invention 
provides a method of searching a database comprising the 
annotation data discussed above, in. response to an input 
query by a user. The method preferably comprises the 
5 steps of generating phoneme data and word data 
corresponding to the user's input query; searching the 
database using the word data corresponding to the user's 
query; selecting a portion of the data defining the 
phoneme and word lattice in the database for further 
10 searching in response to the results of the word search; 
searching said selected portion of the database usxng 
said phoneme data corresponding to the user's input 
query; and outputting the search results. 

15 According to this aspect, the present invention also 
provides an apparatus for searching a database which 
employs the annotation data discussed above for 
annotating data files therein. The apparatus preferably 
comprises means for generating phoneme data and word data 
20 corresponding to a user's input query; means for 
searching the database using the word data corresponding 
to the user's input query to identify similar words 
within the database; means for selecting a portion of the 
annotation data in the database for further searching in 
25 response to the results of the word search; means for 
searching the selected portion using the phoneme data 
corresponding to the user's input query; and means for 
outputting the search results. 

30 The phoneme and annotation data for a data file may be 
generated from the data file itself or from a typed or 
spoken annotation input by the user. 

Exemplary embodiments of the present invention, will now 
35 be described with reference to the accompanying figures, 
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in which: 

Figure 1 is a schematic view of a computer which is 
programmed to operate an embodiment of the present 
5 invention ; 

Figure 2 is a block diagram showing a phoneme and word 
annotator unit which is operable to generate phoneme and 
word annotation data for appendage to a data file; 

10 

Figure 3 is a block diagram illustrating one way in which 
the phoneme and word annotator can generate the 
annotation data from an input video data file; 

15 Figure 4a is a schematic diagram of a phoneme lattice for 
an example audio string from the input video data file; 

Figure 4b is a schematic diagram of a word and phoneme 
lattice embodying one aspect of the present invention, 
20 for an example audio string from the input video data 
file; 

Figure 5 is a schematic block diagram of a user's 
terminal which allows the user to retrieve information 
25 from the database by a voice query; 

Figure 6a is a flow diagram illustrating part of the flow 
control of the user terminal shown in Figure 5; 

30 Figure 6b is a flow diagram illustrating the remaining 
part of the flow control of the user terminal shown in 
Figure 5 ; 

Figure 7 is a flow diagram illustrating the way in which 
35 a search engine forming part of the user's terminal 



WO 00/54168 



PCT/GB00/00718 



4 



carries out a Phoneme search within the database; 

Figure 8 is a schematic diagram illustrating 

a phoneme string and four M-GRAMS generated from the 

5 phoneme string; 

Figure 9 is a plot showing two vectors and the angle 
between the two vectors; 

10 Figure 10 is a schematic diagram o£ a pair of word and 
phoneme lattices, for example audio strings from two 

speakers; 

Figure 1! is a schematic block diagram 
l5 user terminal which allows the annotation o. a --- --- 

with annotation data generated from an audio Signal input 

from a user; 

Figure 12 is a schematic diagram of phoneme and word 
„ lattice annotation data which is generated for — 
utterance input by the user for annotating a data file, 

Figure 13 is a schematic bloc, diagram ^ 
user terminal which allows the annotation of a data file 
25 with annotation data generated from a typed input from 



user; 



Figure 14 is a schematic diagram of phoneme and word 
Tattice annotation data which is generated for a typed 
30 input by the user for annotating a data file; 

Figure 15 is a block schematic diagram showing the form 
of a document annotation system; 

35 Figure 16 is a bloc* schematic diagram of an alternative 
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document: annotation system; 

Figure 17 is a block schematic diagram of another 
document annotation system; 

Figure 18 is a schematic block diagram illustrating a 
user terminal which is operable to access a database 
located on a remote server via a data network in response 
to an input utterance by the user; 

Figure 19 is a schematic block diagram of a user terminal 
which allows a user to access a database located in a 
remote server in response to an input utterance from the 
use^; 

Figure 2 0 is a schematic block diagram of a user terminal 
which allows a user to access a database by ,a typed input 
query ; and 

20 Figure 21 is a schematic block diagram illustrating the 
way in which a phoneme and word lattice can be generated 
from script data contained within a video data file- 

Embodiments of the present invention can be implemented 
25 using dedicated hardware circuits, but the embodiment to 
be described is implemented in computer software or code, 
which is run in conjunction with processing hardware such 
as a personal computer, work station, photocopier, 
facsimile machine, personal digital assistant (PDA) or 
30 the like. 

Figure 1 shows a personal computer (PC) 1 which is 
programmed to operate an embodiment of the present 
invention. A keyboard 3, a pointing device 5, a 
35 microphone 7 and a telephone line 9 are connected to the 
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PC 1 via an interface 11. The keyboard 3 and pointing 
device 5 enable the system to be controlled by a user. 
The microphone 7 converts acoustic speech signals from 
the user into equivalent electrical signals and supplies 
5 them to the PC 1 for processing. An internal modem and 
speech receiving circuit (not shown) is connected to the 
telephone line 9 so that the PC 1 can communicate wxth, 
for example, a remote computer or with a remote user. 

10 The programme instructions which make the PC 1 operate in 
accordance with the present invention may be supplied for 
use with an existing PC 1 on, for example, a storage 
device such as a magnetic disc 13, or by downloading the 
software from the Internet (not shown) via the internal 

15 modem and telephone line 9. 

. n &TA FTLE aw^nTATlOW 

Figure 2 is a block diagram illustrating the way xn which 
annotation data 21 for an input data file 23 is generated 
20 in this embodiment by a phoneme and word annotating unxt 
25. As shown, the generated phoneme and word annotation 
data 21 is then combined with the data file 23 xn the 
data combination unit 27 and the combined data file 
output thereby is input to the database 29. In this 
25 embodiment, the annotation data 21 comprises a combined 
phoneme (or phoneme like) and word lattice which allows 
the user to retrieve information from the database by a 
voice query. A. those skilled in the art will 
appreciate, the data file 23 can be any kind of data 
30 file, such as, a video file, an audio file, a multimedia 
file etc. 

A system has been proposed to generate N-Best word lists 
for an audio stream as annotation data by passing the 
35 audio data from a video data file through an automatic 
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speech recognition unit. However, such word-based 
systems suffer from a number of problems. These include 
(i) that state of the art speech recognition systems 
still make basic mistakes in recognition; (ii) that state 
5 of the art automatic speech recognition systems use a 
dictionary of perhaps 20,000 to 100,000 words and cannot 
produce words outside that vocabulary; and (iii) that the 
production of N-Best lists grows exponentially with the 
number of hypothesis at each stage, therefore resulting 
10 in the annotation data becoming prohibitively large for 
long utterances . 

The first of these problems may not be that significant 
if the same automatic speech recognition system is used 

15 to generate the annotation data and to subsequently 
retrieve the corresponding data file, since the same 
decoding error could occur. However, with advances; in 
automatic speech recognition systems being made each 
year, it is likely that in the future the same type of 

20 error may not occur, resulting in the inability to be 
able to retrieve the corresponding data file at that 
later date. With regard to the second problem, this is 
particularly significant in video data applications, 
since users are likely to use names and places (which may 

25 not be in the speech recognition dictionary) as input 
query terms. In place of these names, the automatic 
speech recognition system will typically replace the out 
of vocabulary words with a phonetically similar word or 
words within the vocabulary, often corrupting nearby 

30 decodings. This can also result in the failure to 
retrieve the required data file upon subsequent request. 

In contrast, with the proposed phoneme and word lattice 
annotation data, a quick and efficient search using the 
35 word data in the database 29 can be carried out and, if 
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15 



20 



this fails to provide the required data file, then a 
further search using the more robust phoneme data can be 
performed. The phoneme and word lattice is an acyclic 
directed graph with a single entry point and a single 
exit point. It represents different parses of the audio 
stream within the data file. It is not simply a sequence 
of words with alternatives since each word does not have 
to be replaced by a single alternative, one word can be 
substituted for two or more words or phonemes, and the 
whole structure can form a substitution for one or more 
words or phonemes. Therefore, the density of data within 
the phoneme and word lattice essentially remains linear 
throughout the audio data, rather than growing 
„™«.<a"v AB in the case of the N-Best technique 
discussed above. As those skilled in the art of speech 
recognition will realise, the use of phoneme data is more 
robust, because phonemes are dictionary independent and 
allow the system to cope with out of vocabulary words 
such as names, places, foreign words etc. The use of 
phoneme data is also capable of making the system future 
proof, since it allows data files which are placed into 
the database to be retrieved even when the words were not 
understood by the original automatic speech recognition 



system. 

The way in which this phoneme and word lattice annotation 
data can be generated for a video data file will now be 
described with reference to Figure 3. As shown, the 
video data file 31 comprises video data 31-1, which 

30 defines the sequence of images forming the video sequence 
and audio data 31-2, which defines the audio which is 
associated with the video sequence. As is well known, 
the audio data 31-2 is time synchronised with the video 
data 31-1 so that, in use, both the video and audio data 

35 are supplied to the user at the same time. 
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As shown in Figure 3, in this embodiment, the audio data 
31-2 is input to an automatic speech recognition unit 33, 
which is operable to generate a phoneme lattice 
corresponding to the stream of audio data 31-2. Such an 
5 automatic speech recognition unit 33 is commonly 
available in the art and will not be described in further 
detail. The reader is referred to, for example, the book 
entitled 'Fundamentals of Speech Recognition' by Lawrence 
Rabiner and Biing-Hwang Juang and, in particular, to 

10 pages 42 to 50 thereof, for further information on this 
type of speech recognition system- 
Figure 4a illustrates the form of the phoneme lattice 
data output by the speech recognition unit 33, for the 

15 input audio corresponding to the phrase • . - .tell me about 
Jason. . . * . As shown, the automatic speech recognition 
unit 33 identifies a number of different possible phoneme 
strings which correspond to this input audio utterance. 
For example, the speech recognition system considers that 

20 the first phoneme in the audio string is either a /t/ or 
a /d/. As is well known in the art of speech 
recognition, these different possibilities can have their 
own weighting which is generated by the speech 
recognition unit 33 and is indicative of the confidence 

25 of the speech recognition unit's output. For example, 
the phoneme /t/ may be given a weighting of 0.9 and the 
phoneme /d/ may be given a weighting of 0.1, indicating 
that the speech recognition system is fairly confident 
that the corresponding portion of audio represents the 

30 phoneme /t/, but that it still may be the phoneme /d/ . 
In this embodiment, however, this weighting of the 
phonemes is not performed. 

As shown in Figure 3, the phoneme lattice data 35 output 
35 by the automatic speech recognition unit 33 is input to 
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a word decoder 37 which is operable to identify possible 
words within the phoneme lattice data 35. In thxs 
embodiment, the words identified by the word decoder 37 
are incorporated into the phoneme lattice data structure. 
5 For example, for the phoneme lattice shown in Figure 4a, 
the word decoder 37 identifies the words 'tell', 'dell', 
•term-, 'me', 'a', 'boat', 'about', 'chase' and 'sun'. 
As shown in Figure 4b, these identified words are added 
to the phoneme lattice data structure output by the 
10 speech recognition unit 33, to generate a phoneme and 
word lattice data structure which forms the annotation 
data 31-3. This annotation data 31-3 is then combined 
with the video data file 31 to generate an augmented 
«t 31 * which is then stored in the database 

V-UCl^^ t*t*-w~ — 

15 29. As those skilled in the art will appreciate, in a 
similar way to the way in which the audio data 31-2 is 
time synchronised with the video data 31-1, the 
annotation data 31-3 is also time synchronised and 
associated with the corresponding video data 31-1 and 
20 audio data 31-2, so that a desired portion of the video 
and audio data can be retrieved by searching for and 
locating the corresponding portion of the annotation data 
31-3. 



25 



30 



35 



in this embodiment, the annotation data 31-3 stored in 
the database 29 has the following general form: 

HEADER 

- time of start 

- flag if word if phoneme if mixed 

- time index associating the location of 
blocks of annotation data within memory to 
a given time point. 

_ word set used (i.e. the dictionary) 

- phoneme set used 

- the language to which the vocabulary 
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pertains 

Block(i) i = 0,1,2, 

node N-j j = 0,1,2, 

- time offset of node from start of block 
5 - phoneme links (k) k = 0,1,2 

offset to node N-j = N K -Nj (N K is node to 
which link K extends) or if N k is in 
block (i+1) offset to node Nj = N^+Nb-N-, 
(where N b is the number of nodes in 
10 block (i)) 

phoneme associated with link (k) 

- word links (1) 1 = 0,1,2, 

offset to node N-j = N A - N-j (N-j is node . 
to which link 1 extends) or if is in 
15 block (i+1) offset to node N 3 = N^+N b — Nj 

(where N b is the number of nodes in 
block (i)) 

word associated with link (1) 

20 The time of start data in the header can identify the 
time and date of transmission of the data* For example, 
if the video file is a news broadcast, then the time of 
start may include the exact time of the broadcast and the 
date on which it was broadcast. 

25 

The flag identifying if the annotation data is word 
annotation data, phoneme annotation data or if it is 
mixed is provided since not all the data files within the 
database will include the combined phoneme and word 
30 lattice annotation data discussed above, and in this 
case, a different search strategy would be used to search 
this annotation data. 

In this embodiment, the annotation data is divided into 
35 blocks in order to allow the search to jump into the 
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5 



>5 



30 



35 



middle of the annotation data for a given audio data 
stream. The header therefore includes a time index which 
associates the location of the blocks of annotation data 
within the memory to a given time offset between the time 
of start and the time corresponding to the beginning of 
the block. 

The header also includes data defining the word set used 
(i.e. the dictionary), the phoneme set used and the 
language to which the vocabulary pertains. The header 
may also include details of the automatic speech 
recognition system used to generate the annotation data 
and any appropriate settings thereof which were used 
during the generation of the annotation data. 

The blocks of annotation data then follow the header and 
identify, for each node in the block, the time offset of 
the node from the start of the block, the phoneme links 
which connect that node to other nodes by phonemes and 
word links which connect that node to other nodes by 
words. Each phoneme link and word link identifies the 
phoneme or word which is associated with the link. They 
also identify the offset to the current node. For 
example, if node N s „ is linked to node N s5 by a phoneme 
link, then the offset to node N 50 is 5 . As those skilled 
in the art will appreciate, using an offset indication 
like this allows the division of the continuous 
annotation data into separate blocks. 

In an embodiment where an automatic speech recognition 
unit outputs weightings indicative of the confidence of 
the speech recognition units output, these weightings or 
confidence scores would also be included within the data 
structure. In particular, a confidence score would be 
provided for each node which is indicative of the 
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confidence of arriving at the node and each of the 
phoneme and word links would include a transition score 
depending upon the weighting given to the corresponding 
phoneme or word. These weightings would then be used to 
5 control the search and retrieval of the data files by 
discarding those matches which have a low confidence 
score. 

DATA FILE RETRIEVAL - 

10 Figure 5 is a block diagram illustrating the form of a 
user terminal 59 which can be used to retrieve the 
annotated data files from the database 29. This user 
terminal 5 9 may be, for example, a personal computer, 
hand held device or the like. As shown, in this 

15 embodiment, the user terminal 59 comprises the database 
29 of annotated data files, an automatic speech 
recognition unit 51, a search engine 53, a control unit 
55 and a display 57. In operation, the automatic speech 
recognition unit 51 is operable to process an input voice 

20 query from the user 3 9 received via the microphone 7 and 
the input line 61 and to generate therefrom corresponding 
phoneme and word data. This data may also take the form 
of a phoneme and word lattice, but this is not essential. 
This phoneme and word data is then input to the control 

2 5 unit 55 which is operable to initiate an appropriate 
search of the database 29 using the search engine 53. 
The results of the search, generated by the search engine 
53, are then transmitted back to the control unit 5 5 
which analyses the search results and generates and 

30 displays appropriate display data to the user via the 
display 57. 

Figures 6a and 6b are flow diagrams which illustrate the 
way in which the user terminal 59 operates in this 
35 embodiment. In step si, the user terminal 59 is in an 
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idle state and awaits an input query from the user 39. 
Upon receipt of an input query, the phoneme and word data 
for the input query is generated in step s3 by the 
automatic speech recognition unit 51. The control unit 
55 then instructs the search engine 53, in step s5, to 
perform a search in the database 29 using the word data 
generated for the input query. The word search employed 
L this embodiment is the same as is currently being used 
in the art for typed keyword searches, and will not be 
described in more detail- here. If in step .7, the 
control unit 55 identifies from the search results, that 
a match for the user's input query has been found, then 
it outputs the search results to the user via the display 



15 



20 



25 



30 



S.7 



35 



la this embodiment, the user terminal 59 then allows the 
user to consider the search results and awaits the user s 
confirmation as to whether or not the results correspond 
to the information the user requires. If they are, then 
the processing proceeds from step sll to the- end of the 
processing and the user terminal 59 returns to its idle 
state and awaits the next input query. If, however, the 
user indicates (by, for example, inputting an appropriate 
voice command) that the search results do not correspond 
to the desired information, then the processing proceeds 
from step sll "to step s!3, where the search engine 53 
performs a phoneme search of the database 29. However, 
in this embodiment, the phoneme search performed in step 
S13 is not of the whole database 29, since this could 
take several hours depending on the size of the database 
29- 

instead, the phoneme search performed in step .13 uses 
the results of the word search performed in step s5 to 
identify one or more portions within the database which 
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may correspond to the user's input query. The way in 
which the phoneme search performed in step sl3 is 
performed in this embodiment, will be described in more 
detail later. After the phoneme search has been 
5 performed, the control unit 55 identifies, in step sl5, 
if a match has been found. If a match has been found, 
then the processing proceeds to step sl7 where the 
control unit 55 causes the search results to be displayed 
to the user on the display 57. Again, the system then 

10 awaits the user's confirmation as to whether or not the 
search results correspond to the desired information. If 
the results are correct, then the processing passes from 
step sl9 to the end and the user terminal 59 returns to 
its idle state and awaits the next input query. If 

15 however, the user indicates that the search results do 
not correspond to the desired information, then the 
processing proceeds from step sl9 to step s21, where the 
control unit 55 is operable to ask the user/ via the 
display 57, whether or not a phoneme search should be 

20 performed of the whole database 29. If in response to 
this query, the user indicates that such a search should 
be performed, then the processing proceeds to step s2 3 
where the search engine performs a phoneme search of the 
entire database 29. 

25 

On completion of this search, the control unit 55 
identifies, in step s25, whether or not a match for the 
user's input query has been found. If a match is found, 
then the processing proceeds to step s2 7 where the 

30 control unit 5 5 causes the search results to be displayed 
to the user on the display 57. If the search results are 
correct, then the processing proceeds from step s29 to 
the end of the processing and the user terminal 5 9 
returns to its idle state and awaits the next input 

35 query. If, on the other hand, the user indicates that the 
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search results still do not correspond to the desired 
information, then the processing passes to step S3 where 
the control unit 55 queries the user, via the display 57, 
whether or not the user wishes to redefine or amend the 
search query. If the user does wish to redefine or amend 
th e search query, then the processing returns to step s3 
where the user's subsequent input query is processed in 
a similar manner. If the search is not to he redefined 
or amended, then the search results and the user s 
initial input query are discarded and the user terminal 
5* returns to its idle state and awaits the next input 



query . 



dhow F. MF. SE ARCH 

5 7T^^~ above, in steps s!3 and s23, the searcn 
engine 53 compares the phoneme data of the -P» = 
with the phoneme data in the phoneme and word lattice 
annotation' data stored in the database „ Various 
techniques can be used including standard pattern 

!0 matching techniques such as dynamic programming 
out this comparison. In this embodiment, » 
Which we refer to as M— GRAMS 1. used. This technique was 
proposed by Kg, 1. and Sue. v.w. and is 

example, the paper entitled "Subword unit "P"""""^ 
25 for spoKen document retrieval" published in 
proceedings of Eurospeech 1997. 

The problem with searching for individual phonemes is 
that there will be many occurrences of each phoneme 
within the database. Therefore, an individual phoneme on 
its own does not provide enough discrimmability to be 
able to match the phoneme string of the input query with 
the phoneme strings within the database. Syllable sized 
units, however. are lifcely to provl d, . -» 
discriminability, although they are not easy to identify. 



30 



35 
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The M-GRAM technique presents a Suitable compromise 
between these two possibilities and takes overlapping 
fixed size fragments, or M-GRAMS , of the phoneme string 
to provide a set of features. This is illustrated in 
5 Figure 8, which shows part of an input phoneme string 
having phonemes a, b, c r d, e, and f, which are split 
into four M-GRAMS (a, b, c), (b, c f d) , (c, d, e) and (d, 
e, f). In this illustration, each of the four M-GRAMS 
comprises a sequence of three phonemes which is unique 
10 and represents a unique feature (fi) which can be found 
within the input phoneme string. 

Therefore, referring to Figure 7, the first step s51 in 
performing the phoneme search in step sl3 shown in Figure 

15 6, is to identify all the different M— GRAMS which are in 
the input phoneme data and their frequency of occurrence. 
Then, in step s53, the search engine 53 determines the 
frequency of occurrence of the identified M-GRAMS in the 
selected portion of the database (identified from the 

20 word search performed in step s5 in Figure 6). To 
illustrate this, for a given portion of the database and 
for the. example M-GRAMS illustrated in Figure 8, this 
yields the following table of information: 



M-GRAM 
(feature ( fi) ) 


Input phoneme 
string frequency 
of occurrence 

(a) 


Phoneme string 
of selected 
portion of 
database 
<a) 


Mi 


1 


0 


M 2 


2 


2 


M 3 


3 


2 


M 4 


1 


1 
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Next, in step s5S, the search engine 53 calculates a 
similarity score representing a similarity between the 
phoneme string of the input query and the phoneme string 
of the selected portion from the database. In this 
embodiment, this similarity score is determined using a 
cosine measure using the frequencies of occurrence of the 
identified M- GRAMS in the input query and in the selected 
portion of the database as vectors. The philosophy 
behind this technique is that if the input phoneme strxng 
is similar to the selected portion of the database 
phoneme string, then the frequency of occurrence of the 
M— GRAM features will be similar for the two phoneme 
strings. Therefore, if the frequencies of occurrence of 
». ^*«o considered to be vectors (i.e. 

considering the second and third columns in the above 
table as vectors), then if there is a similarity between 
the input phoneme string and the selected portion of the 
database, then the angle between these vectors should be 
small. This is illustrated in Figure 9 for two- 
dimensional vectors a and a , with the angle between the 
vectors given as 9. In the example shown in Figure 8 
the vectors a and a will be four dimensional vectors and 
the similarity score can be calculated from: 



25 



30 



score = cose = Ti ^ r d> 

This score is then associated with the current selected 
portion of the database and stored until the end of the 
search. In some applications, the vectors used in the 
calculation of the cosine measure will be the logarithm 
of these frequencies of occurrences, rather than the 
frequencies of occurrences themselves. 
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The processing then proceeds to step *s57 where the search 
engine 53 identifies whether or not there are any more 
selected portions of phoneme strings from the database 
29. If there are, then the processing returns to step 
5 s53 where a similar procedure is followed to identify the 
score for this portion of the database. If there are no 
more selected portions, then the searching .ends and the 
processing returns to step sl5 shown in Figure 6, where 
the control unit considers the scores generated by the 
10 search engine 53 and identifies whether or not there is 
a match by, for example, comparing the calculated scores 
with a predetermined threshold value. 

As those skilled in the art will appreciate, a similar 
15 matching operation will be performed in step s2 3 shown in 
Figure 6. However, since the entire database is being 
searched, this search is carried out by searching each of 
the blocks discussed above in turn. 

20 ALTERNATIVE EMBODIMENTS 

As those skilled in the art will appreciate, this type of 
phonetic and word annotation of data files in a database 
provides a convenient and powerful way to allow a user to 
search the database by voice. In the illustrated 

25 embodiment, a single audio data stream was annotated and 
stored in the database for subsequent retrieval by the 
user. As those skilled in the art will appreciate, when 
the input data file corresponds to a video data file, the 
audio data within the data file will usually include 

3 0 audio data for different speakers. Instead of generating 
a single stream of annotation data for the audio data, 
separate phoneme and word lattice annotation data can be 
generated for the audio data of each speaker. This may 
be achieved by identifying, from the pitch or from 

35 another distinguishing feature of the speech signals, the 
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audio data which corresponds to each of the speakers and 
then by annotating the different speaker's audio 
separately. This may also be achieved if the audio data 
was recorded in stereo or if an array of microphones were 
used in generating the audio data, since it is then 
possible to process the audio data to extract the data 
for each speaker. 

Figure 10. illustrates the form of the annotation data in 
such an embodiment, where a first speaker utters the 
words "... this so" and the second speaker replies "yes". 
As illustrated, the annotation data for the different 
speakers' audio data are time synchronised, relative to 
-^k ^har. so that the annotation data is still time 
5 synchronised to the video and audio data within the data 
file. in such an embodiment, the header information in 
the data structure should preferably include a list of 
the different speakers within the annotation data and, 
for each speaker, data defining that speaker's language, 
0 accent, dialect and phonetic set, and each block should 
identify those speakers that are active in the block. 

in the above embodiments, a speech recognition system was 
used to generate the annotation data for annotating a 

IS data file in the database. As those skilled in the art 
will appreciate, other techniques can be used to generate 
this annotation data. For example, a human operator can 
listen to the audio data and generate a phonetic and word 
transcription to thereby manually generate the annotation 

30 data. 

in the above embodiments, the annotation data was 
generated from audio stored in the data file itself. As 
those skilled in the art will appreciate, other 
35 techniques can be used to input the annotation data. 
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Figure 11 illustrates the form of a user terminal 59 
which allows a user to input voice annotation data via 
the microphone 7 for annotating a data file 91 which is 
to be stored in the database 29. In this embodiment, the 
5 data file 91 comprises a two dimensional image generated 
by, for example, a camera. The user terminal 59 allows 
the user 3 9 to annotate the 2D image with an appropriate 
annotation which can be used subsequently for retrieving 
the 2D image from the database 29. In this embodiment, 

10 the input voice annotation signal is converted, by the 
automatic speech recognition unit 51, into phoneme and 
word lattice annotation data which is passed to the 
control unit 55. In response to the user's input, the 
control unit 55 retrieves the appropriate 2D file from 

15 the database 29 and appends the phoneme and word 
annotation data to the data file 91- The augmented data 
file is then returned to the database 29. During this 
annotating step, the control unit 55 is operable to 
display the 2D image on the display 57 so that the user 

20 can ensure that the annotation data is associated with 
the correct data file 91. 

The automatic speech recognition unit 51 generates the 
phoneme and word lattice annotation data by (i) 

25 generating a phoneme lattice for the input utterance; 
(ii) then identifying words within the phoneme lattice; 
and (iii) finally by combining the two. Figure 12 
illustrates the form of the phoneme and word lattice 
annotation data generated for the input utterance 

30 "picture of the Taj-Mahal". As shown, the automatic 
speech recognition unit identifies a number of different 
possible phoneme strings which correspond to this input 
utterance. As shown in Figure 12, the words which the 
automatic speech recognition unit 51 identifies within 

35 the phoneme lattice are incorporated into the phoneme 
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lattice data structure. As shown, for the example phrase, 
the automatic speech recognition unit 51 id.ntifx- the 
words "picture", "of", -off, -1*e«, "other", *a 
"tar", "jam", "ah", "hal", "ha"- and -al-. The control 
unit 55 is then operable to add this annotation data to 
the 2D image data file 91 which is then stored m a 
database 29. 

As those skilled in the art will appreciate, this 
embodiment can be used to annotate any kind of image such 
as x-rays of patients, 3D videos of, for example , NMR 
scans, ultrasound scans etc. It can also be used to 
annotate one-dimensional data, such as audio data 

^ . 

^> c: J_ 

in the above embodiment, a data file was annotated from 
a voiced annotation. As those skilled in the art will 
appreciate, other techniques can be used to ******* 
annotation. For example, Figure 13 illustrates the form 
of a user terminal 59 which allows a user to input typed 
annotation data via the keyboard 3 for annotating a data 
fi le 91 which is to be stored in a database 29. In this 
embodiment, the typed input is converted, by the phonetic 
transcription unit 75, into the phoneme and word lattice 
annotation data (using an internal phonetic denary 
(not shown)) which is passed to the control unit 55. In 
response to the user's input, the control unit 55 
retrieves the appropriate 2D file from the database 29 
and appends the phoneme and word annotation data to the 
> data file 91. lb. augmented data file is then returned 
to the database 29. During this annotating step, the 
control unit 55 is operable to display the 2D image on 
the display 57 so that the user can ensure that the 
annotation data is associated with the correct data file 
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Figure 14 illustrates the form of the phoneme and word 
lattice annotation data generated for the input utterance 
"picture of the Taj-Mahal". As shown in Figure 2 f the 
phoneme and word lattice is an acyclic directed graph 
5 with a single entry point and a single exit point. It 
represents different parses of the user's input. As 
shown, the phonetic transcription unit 75 identifies a 
number of different possible phoneme strings which 
correspond to the typed input. 

10 

Figure 15 is a block diagram illustrating a document 
annotation system. In particular, as shown in Figure 15 , 
a text document 101 is converted into an image data file 
by a document scanner 103. The image data file is then 

15 passed to an optical character recognition (OCR) unit 105 
which converts the image data of the document 101 into 
electronic text. This electronic text is then supplied 
to a phonetic transcription unit 107 which is operable to 
generate phoneme and word annotation data 10 9 which is 

20 then appended to the image data output by the scanner 103 
to form a data file 111. As shown, the data file 111 is 
then stored in the database 29 for subsequent retrieval. 
In this embodiment, the annotation data 109 comprises the 
combined phoneme and word lattice described above which 

25 allows the user to subsequently retrieve the data file 
111 from the database 2 9 by a voice query. 

Figure 16 illustrates a modification to the document 
annotation system shown in Figure 15. The difference 

30 between the system shown in Figure 16 and the system 
shown in Figure 15 is that the output of the optical 
character recognition unit 105 is used to generate the 
data file 113, rather than the image data output by the 
scanner 103. The rest of the system shown in Figure 16 

35 is the same as that shown in Figure 15 and will not be 
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described further- 



annotation system shown in Figure 15. In 

shown in Figure 17. the input document is 

facsimile unit 115 rather than a scanner 103. The im g 

data output by the facsimile unit is then processed in 
data outp y ^ by the scanner 

the same manner as tne xmoy 

10 3 shown in Figure 15. and will not be described again. 

' In the above embodiment, a phonetic " « 

101 was used for generating the annotation 
annotating the image or text data. As those 
... „ rt will appreciate, other techniques can be used. 

5 For example, a human operator can manually ^™ ™" 
annotation data from the image o£ the document itself. 

» the above embodiment, the database 2. -nd the 
automatic speech recognition unit were both located 

• As those skilled in tne ^ll. 

!0 within the user terminal 5 9 ,s 

riiust^rimbodrment in which the database 2, and 
th sea ch engine 53 are located in a — ~ 
and in which the user terminal 5 9 accesses and contr 1 
« rf«ta files in the database 29 via the network interface 
^ 67 and 69 and a data network 63 .such as the 
Internet,. In operation, the user inputs a voice guery 
via t I microphone V which is converted into phoneme d 
word data by the automatic speech ^ 
30 This data is then passed to the control »« t wh "* 
controls the transmission of this phoneme and word data 

OV er the data network «» ^"^^ ^nen 
within the remote server 60. The sear v> 
carries out the search in accordance with the receive 
35 phoneme and word data or controls the manipulation 
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data files (for example to control the playing, 
forwarding or rewinding of a video file) in accordance 
with the received phoneme and word data. The data 
retrieved from the database 29 or other data relating to 
5 the search is then transmitted back, via the data network 
68, to the control unit 55 which controls the display of 
appropriate data on the display 57 for viewing by the 
user 39- In this way it is possible to retrieve and 
control data files in the remote server 60 without using 
10 significant computer resources in the server (since it is 
the user terminal 59 which converts the input speech into 
the phoneme and word data) . 

In addition to locating the database 29 and the search 

15 engine 53 in the remote server 60, it is also possible to 
locate the automatic speech recognition unit 51 in the 
remote server 60. Such an embodiment is shown in Figure 
19. As shown in this embodiment, the input voice query 
from the user is passed via input line 61 to a speech 

20 encoding unit 73 which is operable to encode the speech 
for efficient transfer through the data network 68. The 
encoded data is then passed to the control unit 55 which 
transmits the data over the network 68 to the remote 
server 60, where it is processed by the automatic speech 

25 recognition unit 51. The phoneme and word data generated 
by the speech recognition unit 51 for the input query is 
then passed to the search engine 53 for use in searching 
and controlling data files in the database 29. 
Appropriate data retrieved by the search engine 53 is 

3 0 then passed, via the network interface 69 and the network 
68, back to the user terminal 59. This data received 
back from the remote server is passed via the network 
interface unit 67 to the control unit 55 which generates 
and displays appropriate data on the display 57 for 

35 viewing by the user. 
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m the above embodiments, the user 'inputs his query by 
voice. Figure 20 shows an alternative embedment in 
which the user inputs the query via the keyboard 3. As 
shown, the text input via the keyboard 3 is passed to 
5 phonetic transcription unit 75 which is °P*« bl * J~ 
generate a corresponding phoneme string from the input 
Lt. This phone™ string together with the words input 
via the keyboard 3 are then passed to the control unit 55 
which initiates a search o£ database using the search 
10 engine 53. The way in which this search is carried out 
is the same as in the first embodiment and will not 
therefore, be described again. As with the other 
embodiments discussed above, the phonet ic ^ 
.. ,„ _„.„k online 53 and/or the database 29 may. all 

unit ' zj , =w 

15 be located in a remote server. 

in the first embodiment, the audio data from the data 
file 31 was passed through an automatic speech 

, + in order the generate the phoneme 
recognition unit in order tne g 

20 annotation data. In some situations, a transor.pt of ^the 
eudio data will be present in the data lie. such an 
embodiment is illustrated in Figure 21 In this 
embodiment, the data file SI represents a digital video 
file having video data 81-1. audio data 81-2 and script 
25 data 81-3 which defines the lines for the various actors 

in the video film- « «>« s=ript 

passed through a text to phoneme converter 83, which 
generates phoneme lattice data 85 using » stored 
dictionary which translates words into possible sequences 
30 of phonemes. This phoneme lattice data 85 is then 
combined with the script data 81-3 to generate the above 
described phoneme and word lattice annotation data 81-4^ 
This annotation data is then added to the data file 81 to 
generate an augmented data file 81- which 
35 to the database 29. As those skilled in the art will 
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appreciate, this embodiment facilitates the generation of 
separate phoneme and word lattice annotation data for the 
different speakers within the video data file, since the 
script data usually contains indications of who is 
talking. The synchronisation of the phoneme and word 
lattice annotation data with the video and audio data can 
then be achieved by performing a forced time alignment of 
the script data with the audio data using an automatic 
speech recognition system (not shown). 

In the above embodiments, a phoneme (or phoneme-like) and 
word lattice was used to annotate a data file. As those 
skilled in the art of speech recognition and speech 
processing will realise, the word "phoneme" in the 
description and claims is not limited to its linguistic 
meaning but includes the various sub-word units that are 
identified and used in standard speech recognition 
systems . 
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CLAIMS: 

X. Data defining a ph.—. — ^ 
co » pI isi» g! ^ uraiity q£ nodes „ ithin * 

5 la tti/eano a piuraiity o £ !* — « -~ 

^lnralitv of links ; and 
of said links- 

2 Data according to any preceding claim. 
... said phoneme and word lattice is arranged 

15 in blocks of nodes. 

3 . Data according to claim L ^"^"jr ' 
defining time stamp information for each 

20 4 . Data according to claim 3. arranged in hlocKs of 

equal time duration. 

, • o ^r- & further comprising 

thereon, wherein saia defining a time 

lattice is associated with further dat defm J 
seguential signai, and wherexn sard t,me 

time synchronised with saiu 
30 information is time y 

sequential signal. 

7 oata according to claim 6, wherein said further data 
defines an audio and/or video signal. 
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8. Data according to claim 7, wherein said further data 
defines at least speech data and wherein said data 
defining said phoneme and word lattice is derived from 
said further data. 

9.. Data according to claim 8 r wherein said speech data 
comprises audio data and wherein said data defining said 
phoneme and word lattice is derived by passing said audio 
signal through an automatic speech recognition system. 

10. Data according to claim 8 or 9, wherein said speech 
data defines the parol of a plurality of speakers, and 
wherein said data defines a separate phoneme and word 
lattice for the parol of each speaker. 

11. Data according to any preceding claim, further 
comprising data defining a weighting for the phonemes 
and/or words associated with said links. 

12. Data according to any preceding claim, wherein at 
least one of said nodes is connected to a plurality of 
other nodes by a plurality of links. 

13. Data according to claim 12, wherein at least one of 
said plurality of links connecting said node to said 
plurality of other nodes is associated with a phoneme and 
wherein at least one of said links connecting said node 
to said plurality of other nodes is associated with a 
word . 

14. A method of searching a database comprising data 
according to any preceding claim, in response to an input 
query, the method comprising the steps of: 

generating phoneme data and/or word data 
corresponding to the input query; 
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phoneme and/or word data generated for 

outputting - dePende °" ^ ^ 

results of said searching step. 

1S A method according to claim wherein said 

searching step comprises the steps of* ^ 
(i) searching the V—^™* ^ query to 

lattiCe ' „ or m ore portions of the phoneme 

selecting one or more p ^ ^ 

, A lat tice for further searching in re p 



results of said word search; and d portions 

searching ^ - PhonL 

of the phoneme and word lattice 
generated for the user's input query. 

1S . A method according to claim » . ^ ^ 

— ^crr P e:rr:::::ectedport l o„sof 

the database. 

„. ,_»- .«-«-. » •'•» "■ ~r,.*s.rr,~ 
::.r;: == - - — 

from the word search. 

„. A method " ^0^ iL^ng 

a number of f«.« ^ laenti£yl „g 

™^^"he data defining said phoneme 
lattice within the database. 
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19. A method according to claim 18, wherein each of said 
features represents a unique sequence of phonemes within 
the phoneme data of the user's input query. 

20. A method according to claim 19, wherein said phoneme 
search employs a cosine measure to indicate the 
similarity between the phoneme data corresponding to the 
user's input query and the phoneme data within the 
database. 

21. A method according to any of claims 14 to 20, 
wherein said search results are output to a display. 

22. A method according to any of claims 14 to 21, 
wherein said input query by the user is input by voice, 
and wherein said step of generating phoneme data and word 
data employs an automatic speech recognition system. 

23. A method according to any of claims 14 to 21, 
wherein said input query is a typed input and wherein 
said step of generating phoneme data and word data 
employs a text-to-phoneme converter. 

24. An apparatus for searching a database comprising 
data according to any of claims 1 to 13, in response to 
an input query, the apparatus comprising: 

means for generating phoneme data and/or word data 
corresponding to the input query; 

means for searching the phoneme and word lattice 
using the phoneme and/or word data generated for the 
input query; and 

means for outputting search results in dependence 
upon the output from said searching means. 

25. An apparatus according to claim 24, wherein said 



WO 00/54168 



PCT/GB00/00718 



32 



55 



using the word data gen.r ^ wotd 

to identify similar words within tn 

lattioe ; seleo ting on. or more portions of th. 

means for select g ^ 

phoneme and word -tt ice ^ ^ ^ 

'^Tiii -r searching said one 

portions of the Phoney - ^.^T^ 

phoneme data generated for the user 

26 An apparatus according to claim 25, "™ ^ 
26. An app QUtput th e results of the 

on tout means is opetau coarn h is 

_ \ i-r> the user before tne pu^i— 

^T. - Elected portions of the datahase. 

„. An apparatus according ^^"J^J^ 

i =: onlv performed in response to «* 
Cn;- 0 us^r i/response to the outputting of the 
results from the word search. 

... An apparatus ^"J^^LT^ 
uhe rein the phone,, seguence 

dine to The user's input query and identifying 
corresponding to the ^ phonMie 

similar features within tne 

lattice within the database. 

;ithir:::phonei « - — — 

30. An apparatus according to --^X^ 
phoneme search employs a cosine measure to 



WO 00/54168 



PCT/GB00/00718 



33 

similarity between the phoneme data corresponding to the 
user's input query and the phoneme data within the 
database. 

31. An apparatus according to any of claims 24 to 3 0, 
wherein said output means comprises a display* 

32. An apparatus according to any of claims 24 to 31 , 
wherein said input query by the user is a voice query, 
and wherein said means for generating phoneme data and 
word data comprises an automatic speech recognition 
system which is operable to generate said phoneme data 
and a word decoder which is operable to generate said 
word data . 

33. An apparatus according to any of claims 24 to 31, 
wherein said input query is a typed query and wherein 
said means for generating phoneme data and word data 
comprises a text-to-phoneme converter which is operable 
to generate said phoneme data. 

34 . An apparatus for generating annotation data for use 
in annotating a data file comprising audio data, the 
apparatus comprising : 

an automatic speech recognition system for 
generating phoneme data for audio data in the data file; 

a word decoder for identifying possible words within 
the phoneme data generated by the automatic speech 
recognition system; and 

generating means for generating annotation data by 
combining the generated phoneme data and the decoded 
words . 

35. An apparatus for generating annotation data for use 
in annotating a data file comprising text data, the 
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apparatus comprising: _ generating phoneme 

a text to phoneme converter tor g 
data for text data in the data file; ^ by 

generating means for generating annot 
combining the phoneme data and words xn tb. 

aratus for generating annotation data for use 
36. An a ^ aratUS da f t ° a r f g ilef the apparatus comprising: 
in annotating a data ^ J vo±ce Bignal , 

input means for receiv g the inp ut 

speech recognition means for con 

coining the phoneme data and the words. 



35 



, H»ta for use 

„ aratus for generating Bmww«— 
37. A" apparatus f g atus comprising : 

in annotating a data f l . ^ ^ & user; 

inPUt neanS i rco v er tin g words in the typed 
converting means ior 

input into phoneme data; and annotatio n data by 

generating means * ^ ^ 

combining the phoneme data and wor 

DD aratus for generating annotation data for use 
38. An apparatus t 9 us uprising: 

in annotating a data file, th PP ive of 

m eans for receiving image data rep 

text; . moa ns for converting said 

character recognition means 

combining the phoneme data and voids in 

^. ^ _ nv of claims 34 to 38, 



WO 00/54168 



PCT/GBOO/00718 



35 

lattice and wherein said generating means comprises: 

(i) means for generating data defining a plurality 
of nodes within the lattice and a plurality of links 
connecting the nodes within the lattice; 

(ii) means for generating data associating a 
plurality of phonemes of the phoneme data with a 
respective plurality of links; and 

(iii) means for generating data associating at least 
one of the words with at least one of said links . 

40. An apparatus according to claim 39, wherein said 
generating means is operable to generate said data 
defining said phoneme and word lattice in blocks of said 
nodes - 

41. An apparatus according to claim 39 or 40, wherein 
said generating means is operable to generate data 
defining time stamp information for each of said nodes . 

42. An apparatus according to claim 41, wherein said 
generating means is arranged to generate said phoneme and 
word lattice data in blocks of equal time duration. 

43. An apparatus according to claim 40, 41 or 42, 
wherein said generating means is operable to generate 
data which defines each block's location within a 
database . 

44. An apparatus according to claim 41 or any claim 
dependent thereon, wherein said data file includes a time 
sequential signal, and wherein said generating means is 
operable to generate time stamp data which is time 
synchronised with said time sequential signal. 

45. An apparatus according to claim 44, wherein said 
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phoneme and wherein at least, one of said links connecting 
said node to said plurality of other nodes is associated 
with a word. 

52. An apparatus according to claim 36 or any claim 
dependent thereon, wherein said speech recognition means 
is operable to generate data defining a weighting for the 
phonemes in the phoneme data, 

53. An apparatus according to claim 52 , wherein said , 
speech recognition means is operable to generate data 
defining a weighting for' the words within the word data. 

54. An apparatus according to claim 36 or 37 or any. 
claim dependent thereon, further comprising means for 
associating said annotation data with said data file. 

55. An apparatus according to claim 37 or any claim 
dependent thereon, wherein said converting means 
comprises an automatic phonetic transcription unit which 
generates said phoneme data from words within the typed 
input . 

56. An apparatus according to claim 38 or any claim 
dependent thereon, wherein said converting means 
comprises an automatic phonetic transcription unit which 
generates said phoneme data from words within the text 
data output by said character recognition means. 

57. An apparatus according to claim 38 or any claim 
dependent thereon, further comprising means for 
associating said annotation data with either said image 
data representative of said text or with said text data. 

58. An apparatus according to claim 38 or any claim 
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Of: 

receiving a typed input; 

converting words in the typed input into phoneme 
data ; and 

5 generating annotation data by combining the phoneme 

data and words in the typed input. 



63. A method of generating annotation data for use in 
annotating a data file, the method comprising the steps 

10 oft 

receiving image data representative of text; 
converting said image data into text data using a 
character recognition unit; 

converting words in the text data into phoneme data; 

15 and 

generating annotation data by combining the phoneme 
data and words within the text data. 

64. A method according to any of claims 5 9 to 63, 
20 wherein said annotation data defines a phoneme and word 

lattice and wherein said generating step comprises the 
steps of: 

(i) generating data defining a plurality of nodes 
within the lattice and a plurality of links connecting 

25 the nodes within the lattice; 

(ii) generating data associating a plurality of 
phonemes of the phoneme data with a respective plurality 

- of links; and 

(iii) generating data associating at least one of 
30 the words with at least one of said links. 



65. A method according to claim 64, wherein said 
generating step generates said data defining said phoneme 
and word lattice in blocks of said nodes. 
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dependent thereon, wherein said speech recognition system 
generates data defining a weighting for the phonemes 
associated with said links, 

74. A method according to claim 59 or any claim 
dependent thereon, wherein said word decoder generates 
data defining a weighting for the words associated with 
said links. 

75 . A method according to claim 64 or any claim 
dependent thereon, wherein said step of defining a 
plurality of nodes and a plurality of links defines at 
least one node which is connected to a plurality of other 
nodes by a plurality of links. 

76. A method according to claim 75, wherein at least one 
of said plurality of links connecting said node to said 
plurality of other nodes is associated with a phoneme and 
wherein at least one of said links connecting said node 
to said plurality of other nodes is associated with a 
word . 

77. A method according to claim 61 or any claim 
dependent thereon, wherein said speech recognition system 
generates data defining a weighting for the phonemes 
associated with said links. 

78. A method according to claim 61 or any claim 
dependent thereon, wherein said speech recognition system 
generates data defining a weighting for the words 
associated with said links - 

79. A method according to claim 61 or 62 or any claim 
dependent thereon, further comprising the step of 
associating said annotation data with said data file. 
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within -the lattice; 

(ii) data for associating a plurality of phonemes of 
the phoneme data with a respective plurality of links; 
and 

5 (iii) data for associating at least one word with at 

least one of said links. 

86. A method for storing a data file into a database, 
the method comprising the steps of: 
10 combining the data file with annotation data 

corresponding to the data file, the annotation data 
including phoneme data; and 

storing the data file with the annotation data. 

15 87. An apparatus for searching a data file including 

annotation data, in response to an input query, the 
apparatus comprising : 

means for generating phoneme data and word data 
1 corresponding to the input query; 
20 means for searching a data file based on the phoneme 

data and/or the word data and the annotation data; and 

means for outputting a search result in dependence 
upon the result of said searching means. 

25 88. An apparatus according to claim 87, wherein said 

annotation data defines a phoneme and word lattice, and 
comprises : 

(i) data- defining a plurality of nodes within the 
lattice and a plurality of links connecting the nodes 

3 0 within the lattice; 

(ii) data associating a plurality of phonemes of the 
phoneme data with a respective plurality of links; and 

(iii) data associating at least one word with at 
least one of said links. 



35 



WO 00/54168 



PCT/GB00/00718 



44 

a - data file into a 



89 . An apparatus for storing 
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* Torino a data file, the data file 
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comprising: 

an audio data; and d 
an annotation data corresponding to the 
S aid annotation data including phoneme data. 

... the data file 

91. A medium for storing a — 

comprising: 

video data; d and 

audio data corresponding to the 
annotation data corresponding to the audio 
annotation data including phoneme data. 
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* data file, the data file 
medium for storing a data txJ. 

comprising: 

text data; and ^ ^he text data, said 

annotation data corresponding to the 
annotation data including phoneme data. 

^no data and further comprising 
93. Data including audio data an ^.^ 
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annotation data includes phoneme data. 
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95. Data including text data, the data further 
comprising annotation data corresponding to the text 
data, which annotation data includes phoneme data. 

5 9 6 . A data carrier carrying data according to any of 

claims 1 to 13 or processor implement able instructions 
for controlling a processor to implement the method of 
any one of claims 14 to 23 or 59 to 83 or 84 to 86. 

10 97. Processor implementable instructions for controlling 

a processor to implement the method of any one of claims 
14 to 23 or 59 to 83 or 84 to 86. 
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